CM 

00 


in 

< 

i 

Q 

< 


AFOSR-TR-  8  5-0  368 


0 


FAULT-TOLERANT  COMPUTING  RESEARCH, 
INTERIM  SCIENTIFIC  REPORT,  GRANT  AF0SR-84-0052, 
15  JANUARY  1984  -  14  JANUARY  1985 


Professor  O.K.  Pradhan 
Department  of  Electrical  and 
Computer  Engineering 
University  of  Massachusetts 
Amherst  MA  01003 


March  4,  1985 


DT1C 

#*ELECTE«% 

5^y22  p 


Approved  for  public  release? 

distribution  unlimited. 


r. 

OWjm 

>v 


* 


M 

»  .  • 

•I-Mv 


.  ■  .■> 


& 

,  -  , 


[Pll  Redacted] 


.*•  '  \  *  -  ..  * 


, "•  ,*•  '•  , ’ •  ,*•  ’•  .%  \  *•«.*.  * >  .*» 


_ UNCLASSIFIED _ 

SECURITY  CLASSIFICATION  Of  THIS  PAGE 


1*  REPORT  SECURITY  CLASSIFICATION 

UNCLASSIFIED 


2»  SECURITY  CLASSIFICATION  AUTHORITY 


REPORT  DOCUMENTATION  PAGE 


lb  RESTRICTIVE  MARKINGS 


2b  OECLASSl  FI  CAT  I  ON /DOWNGRADING  SCHEOULE 


4  FERFORMlNG  ORGANIZATION  REFORT  NUMBEA(S) 


64  NAME  OF  FERFORMlNG  ORGANIZATION  kb.  OFFICE  SYMBOL 

University  of  Massachusetts  tir applicable) 


6c.  ADDRESS  (City.  Stale  and  ZIP  Coda I 

Department  of  Electrical  and  Computer 
Engineering,  Amherst  MA  01003 


Sa.  NAME  OF  FUNOING/SFONSORING 
ORGANIZATION 

AFOSR 


Be.  AOORESS  (City,  Stale  and  ZIP  Coda) 


IBb.  OFFICE  SYMBOL 
Of  opplicobla) 


3  DISTRIBUTION/AVAILABILITY  OF  REFORT 

Approved  for  public  release;  distribution 
unlimited. 


6.  MONITORING  ORGANIZATION  REFORT  NUMB  EEL' SI 

AFOSR -TR*  85-0  368 


76  NAME  OF  MONITORING  ORGANIZATION 

Air  Force  Office  of  Scientific  Research 


7b.  AOORESS  (City,  State  and  ZIP  Code) 

Directorate  of  Mathematical  &  Information 
Sciences,  Bolling  AFB  DC  20332-6448 


9.  FROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 

AFOSR-84-0052 


10  SOURCE  OF  FUNDING  NOS 


PROGRAM 

PROJECT 

TASI 

ELEMENT  NO. 

NO. 

NO 

Bolling  AFB  DC  20332-6448 

61102F 

2304 

A6 

WORK  UNIT 
NO. 


11.  TITLE  (include  Security  Classification) 

FAULT- TOLERANT  COMPUTING  RESEARCH. 


12.  PERSONAL  AUTHOR(S) 

D.K.  Pradhan 


136  tyfe  of  refort 
Interim 


16.  sufflementary  notation 


13b.  TIME  COVERED 
FROM 


114.  DATE  OF  REFORT  <Y r„  Mo..  Day) 

4  MAR  85 


15.  PAGE  COUNT 
20 


COSATI  COOES 


18.  SUBJECT  TERMS  (Con tinue  on  reverie  if  necemary  and  identify  by  btocb  number) 

Fault-tolerant  computing. 


19.  ABSTRACT  (Continue  on  reverse  if  necessary  and  identify  by  block  number t 

this  report  provides  a  synopsis  of  research  performed  in  fault-tolerant  computing,  for  the 
first  year  of  grant  AFOSR-84-0052.  Also  included  is  a  list  of  publications  that  have 
resulted  from  the  research  supported  by  this  grant.  Additionally,  this  report  reviews  the 
future  direction  for  the  continuing  research  under  this  grant. 

In  the  past  year,  this  effort  has  focussed  on  the  following  problems.' - (1)  Investigation  of 

novel  fault-tolerant  processor  array  architectures  with  the  potential  of  a  high  degree  of 
defect  tolerance,  but  having  low  processor  and  interconnect  overhead  associated  with  the 
fault  tolerance  mechanisms;  (2)  Development  of  realistic  models  to  evaluate  the  yield, 
redundancy  and  performance  tradeoffs  for  the  designs.  Such  models  would  help  establish  the 
viability  of  these  architectures,  also  enabling  them  to  be  compared  with  other  designs  in 
the  literature;  (3)  Development  of  new  and  efficient  testing  strategies,  and  recon figuat ion 
schemes  for  their  structures;  (4)  Testable  design  of  large  size  VLSI  memory;  and  (CONTINUED) 


20.  OISTRIBUTION/AVAILABILITV  of  abstract 
UNCLASSIPiED/UNLIMITED  El  SAME  AS  APT.  □  OTIC  USERS  □ 


22a.  NAME  OF  RESPONSIBLE  INDIVIDUAL 

CPT  John  P.  Thomas,  Jr. 


DD  FORM  1473,  83  APR  EDITION  OF  1  JAN  73  IS  OBSOLETE 


21.  abstract  security  classification 
UNCLASSIFIED 


22b  TELEPHONE  NUMBER 
I  Include  Area  Code) 

(202)  767-  5026 


22c  OFFICE  SYMBOL 

NM 


UNCLASSIFIED _ 

security  classification  of  this  page 


UNCLASSIFIED 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  of  THIS  faoe 


s  t' 


TABLE  OF  CONTENTS 

I*  Introduction 

2 

II.  Summary  of  Research  Results 

III.  Publications  Supported  by  AFOSR  84-0052 

w-  Synopsis  of  Future  Research 

V.  Biography  and  Vita  of  PI 


1 


AIR  rOBCI  0TTTC1  0?  SCIESTtFIC  RESEARCH  (A T9CJ 

If 0T1  Cl  07  TRAN'r.'ITTAL  TO  OTIC 
This  technic**.!  s*cy...r>  been  revle*«d  Is 
approved  for  :  •  * ;  •  *s  IA*  AFH  190-12. 

Distribution  i  :  ’  nii.~it.ed. 


KATTH3W  J .  KEkTEl 

Chief,  Technical  Inf orssation Division 


•  V 


I.  INTRODUCTION 


This  report  details  those  research  accomplishments  of  the  first  year; 
APQSR  support,  under  grant  8^-0052.  Research  focused  primarily  on  the 
fault-tolerance  aspects  of  large  area  VLSI  circuits,  particularly  in  the 
context  of  multiprocessor  implementation  of  a  single  chip  or  wafer. 
Additional  research  was  carried  out  in  the  area  of  the  design  of  easily  tes¬ 
table  memory  circuits,  and  the  design  of  sorting  networks  on  a  single  chip. 
The  research  performed  has  been  recognized  in  the  professional  realm, 
evidenced  by  the  list  of  accepted/ published  papers. 

The  report  that  follows  is  organized  into  three  main  sections.  Section 
II  summarizes  the  key  research  results  obtained  to  date.  Section  III  lists 
all  of  the  papers  and  reports  that  have  either  already  been  accepted  for 
publication,  or  that  have  been  submitted  for  publication.  Section  IV  closes 
the  report  by  discussing  future  directions  Indicated  for  the  continuing  re¬ 


search 


II.  SUMMARY  OF  RESEARCH  RESULTS 


Last  year's  research  focused  on  the  following  problems: 

2.1  Investigation  of  novel  fault-tolerant  processor  array  architec¬ 
tures  with  the  potential  of  a  high  degree  of  defect  tolerance,  but 
having  low  processor  and  interconnect  overhead  associated  with  the 
fault  tolerance  mechanisms. 

2.2  Development  of  realistic  models  to  evaluate  the  yield,  redundancy 
and  performance  tradeoffs  for  our  designs.  Such  models  would  help 
establish  the  viability  of  these  architectures,  also  enabling  them 
to  be  compared  with  other  designs  in  the  literature. 

2.3  Development  of  new  and  efficient  testing  strategies,  and  recon¬ 
figuration  schemes  for  our  structures. 

2.4  Testable  design  of  large  size  VLSI  memory. 

2.5  Development  of  novel  sorting  networks  that  can  be  implemented  on  a 
single  chip  or  wafer. 

Highly  parallel  algorithms  that  solve  many  important  computational 
problems  have  been  known  for  several  years.  Regrettably,  the  large  parallel 
processor  arrays  that  are  necessary  to  exploit  the- parallelism  in  these  al¬ 
gorithms  are  expensive  to  implements  therefore,  they  have  not  been  widely 
utilized.  The  proposed  research  has  as  Its  goal  the  development  of  design 
techniques  that  will  allow  such  large  high  performance  arrays  to  be  imple¬ 
mented  on  a  single  large  area  wafer  scale  Integrated  circuit.  This  would 
make  it  feasible  to  use  such  processor  arrays  in  relatively  small, 
application-oriented  systems;  examples  Include  on-board  image  analysis  sys¬ 
tems  in  remote  vehicles,  quick  response  robot  control  systems,  etc. 

As  component  sizes  in  VLSI  approach  the  submicron  level,  increased  chip 
complexity  through  smaller  feature  sizes  appears  more  difficult  to  achieve. 
It  is  therefore  clearly  desirable  to  realize  large  area  VLSI  circuits. 
Unfortunately,  any  significant  increase  in  chip  area,  including  full  wafer 
Integration,  remains  an  elusive  goal  primarily  because  of  the  large  number 


of  fabrication  defects  which  appear  even  in  the  best  of  VLSI  manufacturing 
processes.  It  is  clear  that  such  large  area  VLSI  circuits,  in  order  to  be 
viable,  oust  be  designed  so  as  to  be  "defect  tolerant";  i.e.  they  must 
operate  correctly  even  in  the  presence  of  fabrication  defects.  However, 
traditional  fault  tolerant  design  approaches  cannot  be  directly  applied  to 
this  problem;  also,  the  few  defect  tolerance  schemes  recently  proposed  in 
the  literature  are  either  limited  in  their  applicability  to  memory  circuits, 
or  have  other  significant  shortcomings. 

Here,  we  address  this  important  problem  and  these  related  issues: 
developing  yield  models  for  evaluating  the  effectiveness  of  the  proposed 
fault  tolerant  designs;  additionally,  developing  efficient  testing 
strategies  for  these  complex  circuits. 

Two  specific  topic  areas  which  are  related  to  such  VLSI  designs  are  the 
focus  of  our  research;  these  are  the  areas  of  yield  enhancement  and  perfor¬ 
mance  Improvement.  Analytical  models  are  being  developed  that  evaluate  how 
yield  enhancement  and  performance  improvement  may  both  be  achieved  with  the 
introduction  of  redundancy  into  the  VLSI  design. 

Also  developed  is  a  taxonomy  for  fault-tolerant  multiprocessor  ar¬ 
chitectures  on  large  area  VLSI  circuits.  Such  a  taxonomy  allows  us  to  study 
strengths/ weaknesses  of  various  ad-hoc  schemes  that  have  been  proposed.  At 
the  same  time,  we  can  develop  new  interconnect  structures  that  utilize  VLSI 
area  more  efficiently. 

Also,  we  are  carrying  out  the  following  research  on  system-level  issues 
of  fault-diagnosis  in  the  context  of  multiprocessor  implementation  on  a 
single  chip  or  wafer. 


Moat  previous  research  has  considered  either:  (1)  the  diagnosab i  1 ity 
of  a  system  with  a  predetermined  static  testing  graph  or  (2)  adaptive  test¬ 
ing  graphs  (where  one  test  is  conducted  at  a  time,  its  result  determining 
the  next  test).  Our  approach  is  to  determine  a  minimal  testing  graph  (as 
measured  by  the  number  of  edges)  that  may  be  applied  to  diagnose  at  least 
one  fault.  The  distinction  between  our  approach  and  earlier  work  is  that 
the  tests  are  neither  conducted  sequentially  (as  in  adaptive  methods)  since 
the  graph  is  known,  nor  is  the  graph  static.  Instead,  after  a  fault  has 
been  diagnosed,  a  new  minimal  graph  is  used  to  diagnose  subsequent  faults. 

We  adopt  a  graph-theoretic  model  of  a  distributed  computing  system, 
where  graph  G  -  (V,E).  The  vertices  in  .V  represent  processors  in  the 
system;  the  edges  in  E  represent  communication  links  between  processors. 
The  edges  in  E  are  labelled  (a,b),  where  a  and  b  are  labels  for  vertices  in 
V.  Let  there  be  n  nodes  In  G,  n  -  |Vj .  The  degree,  dj,  of  a  node,  i,  is 

the  number  of  nodes  to  which  it  is  directly  linked.  The  degree,  d,  of  G  is 
the  maximum  of  d^  over  all  i  in  T.  The  distance. between  two  nodes  is  the 

minimum  number  of  edges  that  must  be  traversed  to  travel  between  them.  The 
diameter,  K,  of  G  is  the  maximum  of  the  distances  between  all  possible  pairs 
of  nodes.  The  f-fault  diameter,  Kf,  is  the  maximum  of  the  diameters  of  all 

graphs  obtainable  from  G  by  removing  any  f  nodes.  The  connectivity,  c,  of  G 
is  the  minimum  number  of  nodes  that  must  be  removed  in  order  to  disconnect  G 
(Kc  -  •),  or  reduce  it  to  a  solitary  node.  (G  can  tolerate  t  -  c-1  faults 

without  risking  disconnection). 

Thus,  previous  research  derived  the  conditions  determining  precisely 
when  a  given  set  of  tests  in  a  homogeneous  system  achieved  a  specified  level 


of  sel f-diagnosab i Hty .  A  new  methodology  is  pursued  here  with  the  objec¬ 
tive  of  minimizing  the  overhead  associated  with  periodic  testing. 

Specifically,  decreasing  the  testing  required  from  0[nt]  tests  to  Otn] 
tests  would  Improve  the  performance  of  the  system.  The  savings  could  be 
distributed  in  any  way  desired  amongst  these  three  factors: 

(1)  testing  overhead.  Some  of  the  system  time  devoted  to  testing  could 
be  recovered  for  useful  work. 

(2)  test  reliability.  The  fewer  tests  could  be  allotted  more  time — 
likely  making  them  more  thorough. 

(3)  test  frequency.  The  fewer  tests  could  be  conducted  more  fre¬ 
quently,  yfelding  a  better  average  time  between  component  failure  and 
detection. 

Diagnosis  must  be  considered  in  both  synchronous  and  asynchronous 
environments.  A  synchronous  environment  is  usually  achieved  by  message, 
passing;  the  processing  elements  operate  as  though  with  a  common  clock.  A 
synchronous  environment  enjoys  the  advantage  of  allowing  the  processors  to 
conduct  their  tests  simultaneously.  This  feature  permits  diagnosis  by  an 
analysis  of  the  set  of  test  results. 

In  summary,  we  pursue  a  strategy  of  not  utilizing  the  full  capacity  of 
the  allowable  testing  graph  in  an  effort  to  arrive  at  a  more  efficient 
diagnosis. 

Also  being  investigated  is  a  new  design  of  easily  testable  memory.  The 
Impact  of  VLSI  is  no  where  more  dramatic  than  in  the  area  of  Random  Access 
Memory  (RAM)  design.  The  very  marked  improvement  in  RAM  density  has  chiefly 
resulted  from  two  factors:  firstly,  the  improvement  in  fabrication  technol¬ 
ogy  has  made  way  for  a  significant  decrease  in  minimum  feature  size. 


Secondly,  the  evolution  of  the  storage  cell  within  the  RAM  itself  has  seen  a 
significant  decrease  In  size  -  evolved  from  the  initial  6  transistor  static 
;ell  to  the  1  transistor  dynamic  RAM  cell. 


Design  Improvements  In  the  RAM  have  also  brought  on  corresponding,  sig¬ 
nificant  problems,  as  described  below. 

Testing  Complexity 

Dynamic  single  transistor  cells  permit  very  high  integration  densities 
ind  will  probably  be  used  In  all  future  generations  of  memories.  However, 
these  cells  are  susceptible  to  charge  leakage  and  alpha  particle 
sensitivity.  Charge  leakage  is  a  complex  phenomenon  and  in  general.  Is  a 
function  of  the  state  of  the  neighboring  cells,  giving  rise  to  pattern 
sensitivity.  Also  the  proximity  of  the  cells  has  given  rise  to  crosstalk. 
These  soft  errors,  together  with  the  usual  open,  short  and  stuck-at  faults, 
sake  memory  testing  a  complex  problem. 
field 

Although  this  problem  Is  not  specific  to  memories.  It  is  a  major 
abstacle  towards  integrating  larger  memories  on  a  chip.  Since  feature  sizes 
*111  shrink  more  slowly,  larger  memories  can  only  be  obtained  by  increasing 
the  die  area  and  yield  decreases  exponentially  with  increasing  area. 

Iraceful  Degradation 

Memories  have  always  been  small,  low-cost  units;  so  until  now,  graceful 
fegradation  has  not  been  an  issue.  However,  as  memory  sizes  move  to  the 
aegablt  range,  each  chip  would  represent  a  considerable  percentage  of  the 
»ntlre  memory  system.  This  would  make  the  system  very  susceptible  to  single 
joint  failures.  For  example,  if  a  one  megabyte  memory  system  Is  to  be 
leslgned  using  6MK  by  Ibit  chips,  then  the  system  would  be  organized  as  16 


nks  of  8  chips  each.  If  a  single  chip  fails,  then  that  bank  can  be  iso- 
ted  and  the  system  can  continue  with  reduced  memory.  If,  however,  1M  by 
it  memory  chips  are  used,  then  a  single  failure  would  cause  the  loss  of 
e  entire  system.  Even  with  the  use  of  error  correcting  codes,  the  ability 
1  degrade  gracefully  would  warrant  another  layer  of  fault  tolerance  to  the 
stem.  It  is  projected  that  this  will  be  a  requirement  in  future  designs, 
oruer  to  address  these  Just-described  problems,  a  brand  new  RAM  architec- 
re  is  being  developed  here,  with  the  following  properties: 

(a)  Provide  redundancy  at  different  levels  to  improve  fault- tolerance 
and  yield. 

(b)  Provide  easily  testable  properties  that  reduce  the  test 
complexity.  The  proposed  design  has  the  potential  for  keeping  the 
testing  time  constant  with  the  Increase  in  the  size  of  the  RAM. 

(c)  Provide  graceful  degradation  for  operational  faults. 

Already,  significant  progress  to  this  end  has  been  made  and  an  actual 
ototype  is  being  built,  using  the  MOSIS  facility.  Finally,  work  is  being 
rried  out  on  de  Bruijn  multiprocessor  networks.  Specifically,  we  have 
rived  results  which  use  de  Bruijn  graphs  to  design  a  versatile  sorting 
twork . 

Recent  work  has  classified  sorting  architectures  as,  (A)  Sequential 
put/Sequentlal  output,  (B)  Parallel  Input/Sequential  output,  (C)  Parallel 
put/Parallel  output,  (D)  Sequential  input/Parallel  output  and  (E)  Hybrid 
iput/Hybrld  output.  The  classification  is  based  not  only  on  the  I/O 
thod,  but  also  on  the  sorting  algorithm,  as  well  as  on  the  type  of  keys 
ed.  We  have  demonstrated  that  the  architectures  based  on  the  undirected 
Bruijn  graphs  (DGs)  can  sort  data  items  in  all  of  the  above-mentioned 
tegorles.  To  the  best  of  our  knowledge,  no  other  single  network  which  can 
rt  data  items  in  all  the  categories  is  known.  Sorting  algorithms  and  time 


mplexlties  that  correspond  to  each  of  these  categories  are  derived  here. 
s  algorithms  are  distributed  in  the  sense  that  these  are  executed  by  in- 
vldual  processors  without  any  centralized  controller.  It  is  shown  that 
sse  architectures  can  achieve  the  previously  known  best  upper  bound  tiroes, 
all  of  the  categories.  Also,  it  is  shown  that  they  work  as  sorting  net- 
rks,  even  in  the  presence  of  some  faults. 
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Journals 
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Proceedings  of  the  12th  Annual  International  Symposium  on  Computer 
Architecture,  June  1985  (to  appear),  (with  M.R.  Samatham). 
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Schlumberger) . 
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Arbor,  MI,  June  1985  (with  Fred  Meyer). 


IV.  SYNOPSIS  OF  FUTURE  RESEARCH 

The  goal  of  our  research  here  is  to  develop  area  efficient  and  testable 
fault  tolerant  VLSI  structures,  and  to  investigate  the  feasibility  and  cost 
effectiveness  of  implementing  them  on  a  single  large  area  (including  wafer 
scale)  Integrated  circuit.  Towards  this  goal,  we  are  undertaking  the  fol¬ 
lowing  research  tasks. 

1  ♦  We  plan  to  develop  new  f ault-tolerant  architectures  that  will 
provide  more  efficient  use  of  redundancy  for  yield  and  performance 
Improvement.  A  broad  class  of  existing  networks  will  also  be  studied  to 
determine  techniques  to  incorporate  fault- tolerance  in  these  structures.  We 
shall  also  develop  a  unified  framework  through  which  diverse  fault-tolerance 
issues  such  as  performance  improvement  and  testability  Improvement  can  be 
studied. 

fr. 2.  Models  for  evaluating  redundant  VLSI  structures  will  be 
developed.  Our  models  will  have  wide  applicability  and  will  thus  allow  us 
to  compare  different  designs.  They  will  also  be  detailed  enough  to  meaning¬ 
fully  predict  fabrication  yields.  Futhermore,  since  it  is  useful  to  find 
methods  by  which  one  can  optimally  share  available  on-chip  redundancy  be¬ 
tween  yield  enhancement  and  performance  Improvement,  we  also  plan  to  develop 
such  a  model  that  can  be  used  to  study  the  effect  of  sharing  available 
redundancy  between  these  two  somewhat  competing  requirements.  No  such 
models  we  believe  yet  exist. 

3»  Several  problems  related  to  testing  and  reconfiguration  of  these 
arrays  will  be  studied.  Our  approach  differs  from  the  existing  approaches 


to  multiprocessor  diagnosis  In  that  It  Is  tailored  specifically  to  the  con¬ 
straints  posed  by  VLSI  processor  arrays.  Both  the  distributed  and 
centralized  modes  of  testing  will  be  considered. 

To  help  establish  the  feasibility  of  some  of  the  array  struc¬ 
tures,  we  will  develop  models  that  will  allow  realistic  evaluation  of  their 
complexity.  Also,  we  propose  to  layout  and  implement  parts  of  proposed  ar¬ 
ray  structures  using  the  VLSI  CAD  tools  available  at  the  university  and  the 
MOSIS  facility.  Some  of  the  simpler  array  elements  such  as  switch  designs 
can  be  suggested  as  class  projects  in  the  two  semester  VLSI  design  course 
sequence,  taught  at  the  University. 

*«5.  'Also,  continuation  of  our  research  is  planned  into  the  area  of 
the  RAM  design  and  the  sorting  networks. 

The  ultimate  goal  of  our  research  is  the  full  development  of  various 
aspects  of  fault-tolerant  large  area  VLSI  design. 


P,  V,  F.  'W  V 
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